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A  Robust  Alternative  to  the  Normal  Distribution 
D.  L.  McLeish 

Consider  some  of  the  arguments  often  advanced  for  the  use  of  the 
normal  distribution  in  a  given  situation: 

(a)  It  is  thought  that  the  variable  of  interest  can  be  represented  as 

a  sum  of  a  large  number  of  independent,  small,  and  possibly  iden- 

* 

tically  distributed  increments  (i.e.,  the  distribution  is  infinitely 
divisible). 

(b)  The  distribution  is  symmetric  and  has  all  moments  finite  (it  is 
argued  that  most  real  world  measurements  should  have  this  property). 

(c)  The  distribution  is  closed  under  convolutions;  therefore  useful  for 
estimation  and  modelling. 

(d)  The  normal  distribution  (or  a  reasonable  facsimile)  seems  to  fit 
many  real  world  phenomena. 

(e)  The  model  can  be  extended  to  allow  for  dependent  increments. 

(f)  The  distribution  is  easy  to  handle.  In  the  normal  case,  the  maximum 
likelihood  estimates  are  simple,  although  the  distribution  function 
requires  numerical  approximation.  Generation  of  random  normal  var¬ 
iates  is  easy  from  a  uniform  generator. 

On  the  other  hand,  many  arguments  have  been  made  against  the  routine 
assumption  of  normality.  Perhaps  the  most  important  of  these  is  that 
"outliers”  or  "errors"  seem  to  occur  in  otherwise  normal  samples  and  the 
normal  estimates  are  highly  non-robust  to  these.  This  observation  has 
given  rise  to  a  considerable  volume  of  literature  in  the  robust  theory 
of  estimation.  Many  of  the  arguments  and  controversies  surrounding, 
for  example,  the  choice  of  the  psi  function  for  Huber's  M-estimates  are^ 
difficult  for  many  to  appreciate  since  they  are  presented  either  with 
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only  heuristic  Justification  or  in  reference  to  a  seemingly  artificial 
contamination  model.  The  notion  that  there  are  "gross  errors"  in  an 
experimenter '8  data  is  often  one  that  is  difficult  to  account  for:  fur¬ 
thermore,  it  is  unnecessary  to  the  Justification  of  robust  procedures  as 
we  shall  see.  Finally,  although  there  is  widespread  agreement  about  the 
desirability  of  using  robust  methods,  which  such  methods  are  most  effec¬ 
tive  remains  a  controversial  question  (cf.  Stigler  (1977)).  Some  of  the 
remaining  reluctance  to  vise  robust  methods  may  be  related  to  Fisher's 
comment  (quoted  from  Wilkinson  (1979)): 

"This  example  (relating  to  the  Cauchy  distribution)  serves  also  to  illus¬ 
trate  the  practical  difficulty  which  observers  often  find,  that  a  few 
extreme  observations  appear  to  dominate  the  value  of  the  mean.  In  these 
cases  the  rejection  of  extreme  values  is  often  advocated,  and  it  may 
often  happen  that  gross  errors  are  thus  rejected.  As  a  statistical  mea¬ 
sure,  however,  the  rejection  of  observations  is  too  crude  to  be  defended: 
and  unless  there  are  other  reasons  for  rejection  than  mere  divergence 
from  the  majority,  it  would  be  more  philosophical  to  accept  these  extreme 
values,  not  as  gross  errors,  but  as  indications  that  the  distribution  of 
errors  is  not  normal.  As  we  shall  show,  the  only  Pearsonian  curve  for 
which  the  mean  is  the  beBt  statistic  for  locating  the  curve,  is  the  nor¬ 
mal  or  Gaussian  curve  of  errors.  If  the  curve  is  not  of  this  form  the 
mean  is  not  necessarily  of  any  value  whatever.  The  determination  of 
the  true  curves  for  different  types  of  work  is  therefore  of  great  prac¬ 
tical  importance...” 

The  purpose  of  this  paper  is  to  discuss  a  family  of  distributions 
that  seems  more  natural  than  the  contamination  models  (for  example,  they 
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possess  the  properties  (a)  through  (e)  above),  which  approximate  the 
normal  distribution  arbitrarily  closely,  and  yet  for  which  the  w»» 
likelihood  estimators  are  robust.  Property  (f)  is  not  enjoyed  by  this 
family  of  distributions,  since  the  simplest  form  for  the  density  func¬ 
tion  in  general  is  through  a  convergent  power  series.  However,  with 
the  rapid  decrease  in  the  cost  of  high-speed  computation,  this  defect 
is  thought  to  be  relatively  unimportant .  This  distribution  arises  quite 
naturally  in  two  different  ways,  and  in  a  related  paper  is  shown  to  pro¬ 
vide  a  good  fit  to  stock  returns.  Random  generation  of  variates  having 
this  distribution  requires  only  a  normal  and  a  gamma  generator.  The 
density  function  resembles  that  of  the  normal,  having  as  support  the 
whole  real  line,  but  the  greater  kurtosis  makes  this  useful  for  model¬ 
ling  normal-like  data  in  which  there  are  apprarent  "outliers”. 

The  density  function  resembles  that  of  the  normal;  it  is  symmetric, 
unimodal,  and  has  support  the  whole  real  line.  The  kurtosis  is  greater 
than  or  equal  to  the  normal,  making  this  distribution  useful  in  modelling 
phenomena  with  somewhat  heavier  tails  than  the  normal.  It  has  been  used, 
for  example,  by  Sichel  (.1973)  to  model  the  size  distribution  of  diamonds. 

This  family  connects  two  extreme  members,  the  normal  (for  which  the 
maximum  likelihood  estimates  are  not  robust)  and  the  Laplace  (for  which 
the  maximum  likelihood  estimate  of  location  is  the  median  and  is  highly 
robust  though  not  very  efficient  for  normal-like  distributions).  There 
are  other  ways  of  connecting  these  two  extremes.  For  example,  the  ex¬ 
ponential  power  family  (cf.  Wilkinson  (1979))  leads  to  estimates 
of  location,  1  <  p  <  2.  For  p  >  1  ,  these  are  not  robust.  Another 
family  is  proposed  by  0.  Barndorff-Nielson  (1977)  to  model  the 


distribution  of  sices  of  sand  particles.  This  distribution  reseables 
ours  in  many  ways ,  although  it  is  not  closed  under  convolutions .  It 
could,  like  the  present  family,  be  used  to  derive  robust  estimates  of 
location. 

Some  of  the  properties  of  the  present  family  are  also  obtained  by 
Teichroev  (1957). 

For  convenience,  distributions  and  their  parameters  will  be  as  de¬ 
fined  in  Johnson  and  Kotz  (1970). 

The  Density  and  Its  Properties. 

Let  G  be  a  gamma  distributed  variate  with  parameters  (a ,2)  and 
let  Z  be  a  standard  normal  variate  independent  of  G.  Then  the  den¬ 
sity  function  of  is: 

(i)  eaU>  -  -qr1 - f  y“"3/2  •'(z2/2y)')r/2  a,  . 

°  ^  ssn «)  'o 

This  density  is  finite  for  all  z  +  0  if  a  >  0  and  finite  for  all  z 
if  a  >  h.  For  a  >  *5  * 


(2) 


V0’ 


r<°  -  h)  . 

2^rl’(a) 


Ihe  modified  BeBsel  function  of  the  second  kind  is  an  even  function  that 
may  be  defined  by: 


V*) 


If  we  now  put  t 


y/2  in  the  definition  of  the  density  g 


we  obtain: 


k 


/I  l\ 

(3)  ga(z)  “  ^r(a)(~)  KaJs(z)  ‘ 

This  density  was  defined  by  Pearson,  Jeffrey,  and  Elderton  (1929)  and 
investigated  further  by  Pearson,  Stouffer  and  David  (1932).  Here  an 
asymptotic  formula  for  large  a  is  also  given.  The  distribution  is 
applied  to  testing  for  differences  between  chi-squared  values  in  con¬ 
tingency  tabled  data,  and  tables  of  the  distribution  function  for  small 
values  of  a  are  provided.  This  density  may  be  used  to  describe  the 
sample  covariance  between  independent,  identically  distributed  normal 
samples . 

It  is  not  difficult  to  show  that  this  family  of  densities  satisfies 
for  positive  a  ,  a  homogeneous  differential  equation  of  the  form 

g»(z)  .  gl&  =  A).  g£(,)  -  ga(z)  =  0  . 

This  obtains  from  the  modified  Bessel  equation  satisfied  by  Ky(z)  , 

»2  K  ♦  *  K  - +  v2)  \  -  o  . 

gQ(z)  also  satisfies  the  difference  equation: 

2 

2(5*717  Sa-i(z)  -  2a«a+l(z)  +  (2a  -  2)  *a(z)  "  0 
which  follows  from  the  equation: 
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A  more  general  family: 

■  l1  -  ‘2)“  •“  «,<*> 

may  also  be  defined.  This  is  closely  related  to  the  Bessel  function 
distribution  (cf.  McKay  (1932)  and  Laha  (195*0).  In  this  model,  how¬ 
ever,  it  is  fairly  difficult  to  disentangle  the  parameters  a, a  from 
the  location-scale  parameters.  We  will  therefore  concentrate  on  (3) 
although  many  of  its  properties  carry  over  to  its  generalization  (cf. 
Press  (1967)). 

We  will  adopt  (3)  as  the  definition  of  the  density  3ince  this  will 
allow  the  function  to  be  defined  even  for  negative  values  of  a  (al¬ 
though  in  this  case  it  is  not  a  density  function). 

We  now  introduce  location  and  scale  parameters  to  obtain  the  more 
general  family  of  densities: 

(M  fUiV.e.a)  -  |  ga(^)  • 

We  express  this  general  family  as  Be(y,6,a).  This  is  the  density  func¬ 
tion  of  a  constant  u  plus  the  product  of  a  standard  normal  variate  with 

O 

one  having  the  distribution  of  the  square  root  of  a  gamma  (a, 26  )  var¬ 
iate.  The  moment  generating  function  of  the  density  is: 

(5)  m(t;y,0,a)  -  e^d  -  eV)*0  . 
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Since  (5)  factors  into  ey * ( l-0t ) “a ( l+6t ) -Ct  ,  it  is  easily  seen  that 
this  is  also  the  density  of  V  +  Gx  -  Gg  where  Gx  and  Gg  are  in¬ 
dependent  gamma  (a, 8)  variates.  Moreover,  since 
m(t;v1«6.a1)  m(t;p2,8,a2)  =  m(t;V3,e,a3)  where  U3  »  and 

=  a1  +  ag  ,  this  distribution  is  closed  under  convolutions  (for  fixed 
scale  parameters)  and  is  therefore  infinitely  divisible.  It  therefore 
has  a  representation  such  as  that  in  property  (a)  above.  The  central 
moments  of  the  distribution  are: 

Elx-ul2^1  -  (se)2*-1  r<?> 

irT(a) 

and  in  particular,  when  2p  -  1  »  1,  2,  U  respectively,  we  obtain 
29rlg.tid  f  202a  ,  and  12e\x(a+l)  . 

/iFr(o) 

Two  important  cases  of  this  family  of  densities  deserve  mention. 
The  first  is  the  case  a  =  1  ,  when 

(6)  Sl{z)=ke'  lzl  . 

This  is  the  Laplace  distribution,  for  which  the  maximum  likelihood  es¬ 
timates  of  location  and  scale  are  sample  median  and  average  absolute 
deviation.  The  second  case  occurs  when,  for  0  a  positive  constant, 

0  approaches  0  and  o  approaches  infinity.  Specifically,  as  0  -*■  0 
f(x;y ,0 ,<j2/202)  approaches  the  normal  (p,o2)  density. 


» 


When  a  is  an  integer,  say  a  •  n  +  1,  n>0,ve  have  the 
following  representation  of  g. 


(7) 


1 

22n-k+l  * 


This  is  clearly  a  convolution  between  Poisson  probabilities  and 
those  as  a  negative  binomial  type,  i.e.. 


(8) 


PJ 


J  a  0,  1,  2,  3,  •  •  •  • 


Indeed,  if  X  has  a  Poisson  distribution  with  parameter  |z|  and  Y 
has  probabilities  specified  by  (8),  then 


g^(z)  *  P(X+Y»n) 


The  first  few  densities  of  this  type  are  easily  written  out: 


-a 


g^z)  -  He 
gg(z)  ■  H(l+  | z |  )e 


gJz)  -  (z2  +  3 1 z |  +  3)e"'z'/l6 


-  * 


S^(z)  ■  ^  ^5  +  5 [ z |  +  2z‘ 


-  * 


See  Figure  1  for  graphB  of  the  densities.  The  distribution  function  may 
be  defined  by  symmetry  and  the  relation: 


(9) 


v(k+l.x) 


P(|X|<*)  -  2  l 

k-0  n  * 


where  y (a,x)  is  the  incomplete  gamma  function  ■  J*  ta_1  e-t  dt. 
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For  the  purpose  of  estimating  location  or  scale,  it  is  often  con¬ 
venient  to  generate  the  score  function  directly.  Setting 

"a'*’  -  1 

and  using  the  difference  equation  satisfied  by  ga  , 

na+l^  =  x  na(x)  +  (2a-l) 
a  ,  ve  have  a  continued  fraction  expansion, 

for  small  a  ,  ve  may  generate  subsequent  densities 
through  the  expression 

(10)  sa+m(x)  =  (f}  ga(x)  ff^T  ^  Vk(x)  ' 

Here,  r(k)  is  the  complete  gamma  function,  y(k,«>).  This  approach 
is  often  preferable  to  the  use  of  the  series  expansion  valid  when 
v  *»  a  -  h  is  not  an  integer: 

’  SfTaT 

An  alternative  expansion  to  (T),  useful  for  large  |z|  ,  and  a 
integral  is  the  following: 


r-  ®  (-) 

Sn  v  V 

8in^~  k-0 


z  -  2k 


kl 


.  2v 


r(-v  +  k  +  i) 


r(v  +  k  +  i) 


and  for  integral 


Wx) 


Then,  given  g 
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ws m* 


T  l®f«.W»'»‘Wwwi  *te*rmmim 


eaM 


r(a)  2a 


y-1  ,  (u- i)(y - 9) 
8 | z |  2!(8z)2 


,  (R-l)(M-9)(n-g5)  , 

*  A  *  II* 

3!(8|z|)3 


o 

where  p  =  4(a  -  3g)  . 

g 

An  expression  for  the  ^  ga( z )  useful  for  maximum  likelihood 
estimation,  is  obtained  by  differentiating  (10): 


jjj  ga(z)  =  -<J/(a)  ga(z)  -  ircot(vTr)  ga(z) 

2k  r 


(11) 


_  00  (£)' 
.  r  _2_ 

2f(a)  sin(vir)  k£Q  k! 


2  v 


M-v  +  k  +  1) 

T(-v  +  k  +  1) 


(-r)  f  z2l 

+  f (v  +-k  +T)'tf(v  +  k  +  D  -  xj 


where  \J>(a)  is  the  digamma  function  ^  &n  T(o). 


Estimation  of  Parameters. 


Before  ve  present  the  maximum  likelihood  estimators  of  the  para¬ 
meters,  we  require  an  elementary  property  of  the  score  function  that  is 
based  on  the  similar  property  for  the  Bessel  function  (cf.  Abramowitz 
and  Stegun  (1964)):  zK^(z)  +  vKy(z)  *  -zKv_1(z)  for  all  v  .  Therefore, 


(12) 


gq(z)  .  Ku-l(z)  z  8q-l(z 

ga(z)  "  ”  Kv(z)  "  "  2(a-l)  ga(z) 


for  a  >  1  and  z  ^  0. 
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We  now  consider  the  problem  of  estimating  the  parameters.  We 
start  with  the  estimation  of  o.  To  begin  with,  as  z  -*■  <»  , 


gaU)  ~ 


1  0-1  -z 

- «  2  e 

r(a)2a 


» 


and 

V‘>  ,  , 

3a~  M 

-  -Wa)  +  *”  —  ■ 

Therefore,  maximum  likelihood  estimation  of  a  for  large  z  is  nearly 

2 

achieved  by  setting  the  sample  mean  of  the  variables  JLn  z^,  i  »  1,  2, 

...,  n  equal  to  their  expected  value  and  solving  for  a.  Moreover, 
since  the  shape  parameter  a  is  primarily  evident  for  a  >  3/2  by  the 
weight  in  the  tails,  and  since  true  maximum  likelihood  estimation  of  a 
is  fairly  cumbersome  computationally,  this  is  one  approach  we  used.  When 
true  maximum  likelihood  estimation  was  attempted  on  samples,  the 
iterates  often  seemed  to  fail  to  converge.  Furthermore,  we  axe  pri¬ 
marily  interested  here  in  efficient  estimation  of  y  ,  and  as  we  shall 
see,  this  depends  very  little  on  getting  an  accurate  estimate  of  a. 

One  possibility  for  the  estimator  of  parameters  is  the  simple  mo¬ 
ment  method.  For  example,  one  approach  tried  was  to  estimate  a  from 
the  relation: 

E  logU^-y)2  *  -y  +  log  02  +  \J>(a) 

where  y  is  Euler's  constant  .57721...  and  *8  the  digaoma  function. 
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This  equation  was  solved  recursively  for  a  after  initial  estimation 
of  y  and  6. 

Another  approach  that  is  feasible  though  reasonably  expensive  is 
full  maximum  likelihood  estimation  of  all  three  parameters.  While  this 
may  be  an  asymptotically  desirable  procedure,  or  useful  if  we  are  in¬ 
terested  in  accurate  estimates  of  o  ,  it  is  not  necessarily  the  best 
method  when  attention  is  focused  on  estimating  y  for  small  sample 
size.  In  fact,  some  evidence  was  obtained  that  full  MLE  had  lower 
efficiency  for  the  estimator  of  y  (n  *  20,  underlying  distribution  * 
normal)  than  the  simple  minded  scheme  used  in  the  simulation. 

The  rather  difficult  question  of  which  scheme  for  estimating  a 
leads  to  highest  efficiency  for  estimation  location  will  not  be  dealt 
with  definitively  here.  However,  after  estimating  a  ,  we  may  estimate 
y  and  0  as  follows: 

The  maximum  likelihood  equation  for  y  is: 


?  gq(zj) 

Jl  ga{ziV 


=  0  with 


or,  if  a  >  1  and  no  zi  -  0  , 


(1^) 

where  wA  «  go_1(zi)/g0f(  z^ . 
for  0  is: 


N 

l  Vi 


y  = 


i*l 


N 

l 

i®l 


Similarly,  the  maximum  likelihood  equation 


N 

l  +  *  ■  o 
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or,  if  a  >  1 


(15) 


K  *  (a-lje* 


n  N 

*  -L  T 

2N  L 
a  i-1 


Wi(Xi_P)‘ 


Now  (lU)  and  (15)  are  iterated  for  fixed  ct  until  convergence  occurs. 
When  a  is  known,  the  Fisher  information  matrix  is  given  by: 


I 


22ga_i(z)/ga(z)  dz 


0 


M 


a-1)2  +  J°°  z1<g2_1(z)/got(z)  dz , 


Some  features  of  these  equations  are  interesting  in  the  light  of  the 
theory  of  robust  estimation.  For  the  purpose  of  this  discussion,  let  us 
assume  a  >  1.5.  Note  by  (2)  that  when  a  >  .5,  ga(0)  *  r(a-^)/2/irr(a) . 
Therefore,  corresponding  to  z  -  0  ,  we  assign  weight  w^  =  (a-l)/(a-3/2) . 
Similarly,  as  z  g(z)  -  — — —  z0-1  e-z  and  so  corresponding 

01  r(a)2a 

to  large  z^  , 


w ^  ~  2(a-l)/zi  . 


This  produces  the  effect  that  the  influence  curve  is  approximately  linear 
in  the  central  region,  but  bounded  for  all  z.  This  is  true  for  any  fi¬ 
nite  a  although  the  influence  curve  for  the  normal  is  unbounded  and 
densities  in  this  family  can  approximate  arbitrarily  closely  the  normal 
density  by  taking  o  sufficiently  large. 
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Thus  maximum  likelihood  estimation  vithin  this  family  seems  to  he 

highly  robust,  but  of  course,  robustness  may  be  achieved  at  considerable 

loss  of  efficiency.  Robustness  against  obtaining  a  false  value  of  a  is 

particularly  important  here,  since  although  the  mean  y  and  the  variance 
o 

26  a  are  relatively  easy  to  estimate,  the  estimate  of  a  itself  is 

subject  to  considerable  error.  We  considered  only  the  loss  in  efficiency 

in  the  estimation  of  the  location  parameter  y  when  a  is  misspecified. 

To  take  an  extreme  case,  let  us  suppose  that  the  observations  arise  from 

2 

a  normal  distribution  (a300)  with  mean  y  and  known  variance  a  *  10. 

2  2 

Let  us  suppose,  however,  we  believe  a  ■  5  (so  the  value  of  0  is  .Iff  ). 
Then  the  asymptotic  relative  efficiency  of  the  ML  estimator  of  y  ob¬ 
tained  from  the  solution  of  (lU)  is 


a2l E  <y(Z)l2 

E  $2{Z) 

el(x)  2 

where  $(x)  *  — r—r  and  Z  is  N(0,ff  ).  Figure  2  shows  the  asymptotic 
8a'x' 

efficiency  of  the  estimator  for  a  normal  sample  when  various  other  values 
of  a  are  assumed.  Note  that  there  is  high  efficiency  for  any  value  of 
a  above  about  2  and  the  worst  value  (a  *  1  for  the  sample  median)  is 
.637.  The  efficiency  seems  to  be  much  flatter  as  a  function  of  sample 
size  than  for  many  other  robust  procedures.  Holland  and  Velsch  (1977) 
show  that  when  the  scale  parameter  is  estimated,  the  small  sample  effi¬ 
ciency  for  many  of  the  robust  procedures  seems  to  be  substantially  below 
its  asymptotic  value.  For  comparison,  we  used  a  ■  3 .46  (asymptotic 
efficiency  ■  95%)  and  Monte  Carlo  methods  on  normal  samples  of  size  10 
to  obtain  an  estimate  of  the  efficiency  for  n  ■  10  of  .94 • 


On  the  other  hand,  if  the  observations  come  from  a  distribution  of 
this  form  with  1  <_  a  <  00  ,  we  may  choose  to  use  a  ■  1.7  in  construct¬ 
ing  the  estimates.  This  provides  asymptotic  efficiency  for  location  at 
least  84.7JJ  for  all  the  members  of  this  family. 

We  now  discuss  briefly  the  asymptotic  efficiency  of  these  estimates 
for  the  Cauchy  distribution,  a  large  tailed  distribution  which  is  clearly 
not  a  member  of  this  family .  Consider  an  M-estimate  in  general,  obtained 
as  the  solution  of: 
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X  in  a  broad  neighbourhood  of  the  optimum  and  as  X  »  ,  it  approaches 

p 

8/ir.  Here,  the  choice  of  the  optimal  X  is  not  nearly  as  critical 
and  results  in  the  better  performance  of  the  estimator  for  n  *  20  in 
the  simulations  of  the  next  section.  Bell  (1980)  explores  problems  in 
adaptive  estimation  of  the  optimal  X  for  the  bi square. 

It  may  be  desired  to  build  in  more  adaptivity  for  large  tailed  dis¬ 
tributions  into  this  family.  In  fact,  the  family  could  be  broadened  to 
include  distributions  whose  tails  decay  at  a  slower  than  exponential 
rate  (and  therefore  lead  to  redescending  score  functions)  by  replacing 
the  gamma  distribution  which  multiplies  the  normal  by  a  distribution 
such  as  Fisher's  F.  Of  course,  the  resulting  distribution  will  no 
longer  be  closed  under  convolutions,  one  of  the  most  attractive  features 
of  the  present  family. 

How  much  "robustness"  is  purchased  with  the  loss  in  asymptotic  ef¬ 
ficiency  evident  In  Figure  2?  Since  the  influence  curve  is  proportional 

gi(x) 

to  the  score  function  — 7 — r  which  is  bounded,  the  estimates  are  robust. 

g(XU) 

This  function  is  also  smooth  for  a  adequately  greater  them  1.  Treat¬ 
ment  of  "outliers"  is  evident  from  the  score  function  graphed  in  Figure 
3  for  various  values  of  a.  These  functions  are  all  asymptotic  to  1 
as  x  -*■  ®>.  Consider,  for  example  the  case  a  *  2.  It  is  seen  that  the 
value  of  the  score  function  at  x  *  1  (roughly  .7  standard  deviations 
from  the  mean)  is  around  .5.  In  other  words,  in  iterating  (14),  one 
observation  at  «  is  only  the  equivalent  of  about  2  observations  at 
x  *  1.  The  degree  of  robustness  for  the  other  values  of  a  may  simi¬ 
larly  be  compared. 
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Small  Sample  Behavior. 

Andrews  et  al.  (1972)  indicate  that  adaptive  estimates  do  not  nor¬ 
mally  outperform  non-adaptive  ones  except  for  moderate  and  large  sample 
sizes.  It  is  therefore  not  expected  that  these  estimates  can  consis¬ 
tently  outperform  others  over  a  "broad  range  of  distributions.  On  the 
other  hand,  the  attractive  features  of  this  family  of  distributions  leads 
one  to  hope  that  it  does  not  suffer  from  the  lack  of  robustness  of  nor¬ 
mal  estimators,  or  from  too  great  a  lack  of  efficiency  by  comparison 
with  others.  There  is  little  doubt  that  these  estimators  will  be  com¬ 
petitive  with  the  trimmed  mean  or  Huber's  proposal  2;  the  influence 
functions  can  be  made  similar  and  the  distributions  considered  here, 
like  Huber's  least  favourable  distribution,  all  have  exponentially  de¬ 
creasing  tails.  It  was  therefore  decided  to  make  the  comparison  on 
possibly  unfavourable  ground;  with  an  M-estimate  of  redescending  type 
such  as  Tukey's  bisquare,  and  including  wide  tailed  distributions  such 
as  the  Cauchy  and  the  slash.  We  also  chose  a  relatively  small  sample 
size,  n  *  20  for  the  comparison.  Adaptive  estimates  are  usually  ex¬ 
pected  to  improve  their  performance  for  increasing  n. 

Exact  maximum  likelihood  estimation  of  all  three  parameters  is 
computationally  feasible  and  was  performed  on  several  samples  of  n*  20. 
Some  problems  are  apparent  because  of  the  very  flat  nature  of  the  like¬ 
lihood  as  a  function  of  a.  For  example,  a  significant  proportion  of 
normal  samples  (a  ■  <*>)  leads  to  MLE  of  a  constrained  to  the  in¬ 
terval  [l,*]  of  1  and  the  resulting  estimate  of  location  the  sample 
median.  This  seems  to  result  in  loss  of  efficiency  for  the  normal. 

Due  to  the  cost  of  full  MLE, it  was  impossible  to  conduct  a  simulation 
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study  of  the  performance  of  the  estimators  and  so  the  following  crude 
substitute  was  used.  It  seemed  to  provide  comparable  efficiency  for 


the  scale  parameter  was  estimated  crudely  by  matching  the  median  abso¬ 
lute  deviation  of  the  sample  with  that  of  the  assumed  distribution. 
Setting  MAD  «  medjx^ -med(X)  |  , 


6 


MAD/1. 146  when  0*2 
MAD/l.58  when  a  ■  3  • 


Finally,  the  location  estimator  was  obtained  by  a  one  step  Hewton- 
Raphson  iteration  of  equation  (lit)  starting  with  the  median. 

Comparison  is  made  with  the  mean  (normal  case)  and  the  sample  median 
for  the  other  three  distributions.  It  is  also  compared  with  Tukey's  bi- 
square,  an  M-estimator  with  n(x)  ■  x(l-x2)2  for  |x|  ■  1  and  0  other¬ 
wise.  The  value  of  X  warn  taken  to  be  1/6.UMAD  ,  a  value  that  is 


observed  by  Bell  (1980)  and  Johns  to  perform  veil  over  a  wide  range  of 
distributions.  For  larger  values  of  X  ,  the  efficiency  for  the  Cauchy 
and  the  slash  improve,  but  the  efficiency  for  the  normal  drops  sharply. 

Table  1  contains  the  results  of  a  simulation.  The  small  sacrifice 
in  terms  of  the  efficiency  for  the  slash  is  exchanged  for  an  improve¬ 
ment  on  the  Laplace,  the  normal  and  the  Cauchy.  This  is  not  meant  to 
imply  that  estimation  from  this  family  should  always  be  preferred  to 
Tukey's  bisquare.  The  strength  of  the  bisquare  rests  in  part  on  its 
treatment  of  situations  not  considered  here.  The  implication  intended 
is  that  model-based  inference  can  be  competitive  even  outside  the  family 
although  it  leaves  open  the  question  of  whether  the  appropriate  scheme 
is  maximum  likelihood. 
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TABLE  I 


SMALL  SAMPLE  EFFICIENCY  OF  ADAPTIVE  ESTIMATES  WITH  RESPECT  TO: 

(A)  THE  MEAN  OR  MEDIAN 

(B)  TUKEY'S  BISQUARE 

Sample  size:  n  *  20 


Table  gives  estimated  efficiency  in  t 


UNDERLYING  DISTRIBUTION 


NORMAL 

LAPLACE 

CAUCHY 

SLASH 

(A) 

93* 

« 

97 

94* 

• 

97 

(B) 

103.5 

108 

107 

96.1* 

Sample 

Size 

8000 

20000 

24000 

24000 

#  efficiency  relative  to  sample  mean. 

*  efficiency  relative  to  sample  median. 


Note:  The  "Princeton  swindle"  was  used  throughout,  (cf. 
Andrews  et  al. ).  I  thank  Barry  Eynon  for  running  these 
simulations  on  the  Stanford  computer,  and  the  Department 
of  Statistics,  Stanford  University,  for  providing  the 
computer  time. 


Modelling  Dependence. 

One  possible  use  for  this  family  of  distributions  is  in  modelling 
processes  that  are  wider  tailed  than  the  normal  and  at  the  same  time 
dependent.  This  may  be  needed,  for  example,  in  determining  whether  an 
estimate  is  robust  against  both  correlation  in  the  sample  and  mild  de¬ 
partures  from  normality.  Interactions  between  these  two  types  of  fai¬ 
lures  in  the  assumed  model  might  be  detected  through  Monte  Carlo  methods. 
There  are  two  rather  distinct  ways  in  which  a  process  having  marginals 

(U)  can  fail  to  be  independent.  Unlike  a  Gaussian  process,  we  may  have 

2  2 

X^,  Xj  uncorrelated  and  X^,  Xj  correlated  for  i  f  J.  Modelling  be¬ 
haviour  of  this  kind  and  testing  estimates  against  this  type  of  "second 
order"  correlation  may  be  important,  especially  since  this  is  a  type  of 
dependence  rarely  checked  for  in  practice.  This  kind  of  behaviour  is 
by  no  means  unusual.  For  example,  consecutive  changes  in  security 
prices  (on  a  log  scale)  often  seem  to  indicate  no  first  order  correla¬ 
tion  but  significant  second  order  correlation. 

We  consider  here  processes  analogous  to  the  simpler  Gaussian  time 
series  such  as  stationary  first  order  autoregressive.  For  a  process 
{X^)  with  marginals  distributed  as  Be(O,0,o)  ,  we  may  define 

as)  Vi  “Sv'f 

V 

The  coefficients  B'  are  not  constant  as  in  the  Gaussian  case,  but  are 
distributed  as  the  square  root  of  a  beta  variate  with  parameters  pa 
and  (l-p)a  where  0  <  p  <  1.  The  errors  e^  are  distributed 
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h 

as  Be(O,0,(l-p)ci)  and  the  variables  B^»  X^  ,  and  e^  are  all  inde- 
pendent.  The  correlation  function  of  a  process  defined  as  in  (18)  is 
that  of  a  stationary  first  order  autoregressive: 

(19)  ph“plh|  where  p "  mt "  HSffe+Sl  • 

Rote  that  for  a  Gaussian  process  with  correlation  function  (19) ,  we 

would  have  Cov(X^,X^+h)  *  constant  p2^  .  In  this  case*  however,  we 

have  a  slower  (exponential)  rate  of  decay  with  the  correlation  between 

and  X^+h  given  by  p^  (note  that  p  >  p2). 

A  similar  process  can  be  defined  with  replaced  by  its  negative, 

resulting  in  p  in  (19)  being  replaced  by  its  negative. 

An  alternative  method  of  generating  dependence  (which  may  be  used 

for  the  generation  of  a  continuous  time  process)  is  the  following:  Let 

Z^  be  a  stationary  (0,1)  Gaussian  sequence  with  correlation  function 

2 

i|>(h).  Let  G^  be  a  sequence  of  (possibly  dependent)  gamma  (a, 26  ) 
variables,  independent  of  the  Z^  sequence.  Define: 

(20)  Xt  -  O'?  Zt  . 

Then  the  autocovariance  function  of  X^  is: 

(21)  Cov(Xt,Xtrt>  .  *(b)KQ*  4U 

(22)  Cov(X^,X^h)  -  (l  +  2l(P2(h))E(0t0trt)  -  loV*  . 
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Note  that  when  Z^  consists  of  i.i.d.  normal  variables,  the  first 
order  correlation  (21)  is  0  but  the  second  order  (22)  is  not  necessa¬ 
rily  so.  Therefore,  this  process  may  be  used  to  model  one  such  as  the 
first  difference  in  the  logarithm  of  stock  prices,  which  exhibit  nearly 
normal  behaviour ,  have  no  apparent  correlation  of  the  first  order,  but 
show  a  tendency  for  large  values  of  |x|  to  be  followed  by  large  values 
of  |x(.  All  we  need  do  to  model  such  a  process  is  introduce  dependence 
in  the  G  sequence  with  either  a  moving  average  or  an  autoregressive 
type  relation  such  as: 


Gt+1  =  BtGt  + 


where  B. ,  G.  ,  6,  are  independent  variables  having  respectively  the  beta 
t  X  t 

(pa,  (l-p)a),  gamma  (a, 29  )  and  gamma  ((l-p)a,26  )  distributions.  This 
is  the  theme  of  (13] . 

A  more  convenient  representation  than  (l8)  is  available  when  a  is 
an  integer.  In  fact,  a  stationary  sequence  can  be  generated  by  the  usual 
first  order  autoregressive  relation: 


(23) 


t+1 


0Xt  +  et 


where  -1  <  p  <  1  and  et  is  a  random  variable  independent  of  X^ 
having  moment  generating  function: 


me(s) 


P  + 


1-P 


l-  e2s2 
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If  ve  interpret  Be(O,0,a)  as  a  point  mass  at  the  origin  whenever 

either  8  or  a  ■  0  ,  this  is  just  the  distribution  of  Be(0,6,a') 

2 

where  a’  is  distributed  binoaially  with  parameters  a  and  1  -  p  . 

Alternatively,  it  may  be  expressed  as  the  sum  of  a  random  variables 

with  distribution  Be(0,6',l)  where  0 '  =  0  or  8  with  probabili- 
2  2 

ties  p  , 1-p  .  Thus,  this  family  of  densities  is  in  Feller’s  class 
L  (Feller  (1971),  PP-  588-590). 

Similarly,  higher  order  autoregressive  schemes  are  obtainable. 
For  example,  for  a  second  order  autoregressive  process  with  distinct 
characteristic  roots  p^  and  P2  ,  both  less  than  1  in  absolute 
value,  the  distribution  of  the  errors  is  the  distribution  of  the  sum 
of  a  i.i.d.  variates  having  distribution  Be(0,8',l)  where  8'  is 
a  random  variable  assuming  three  values,  0,  p1p20  ,  and  6. 
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Figure  2 


Assumed  value  of  a 

Asymptotic  Efficiency  of  Estimate  of  Mean  of 
Normal  Sample  for  Assumed  Value  of  a 
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Figure  3 


#  Standard  deviations  away 

VALUE  3F  SCORE  VS  NUMBER  OF 
STRNOflRD  DEVIATIONS  OF 
OBSERVATION  FROM  MEAN 
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